Text mining and software engineering: an integrated source code and document analysis approach

نویسندگان

  • René Witte
  • Qiangqiang Li
  • Yonggang Zhang
  • Juergen Rilling
چکیده

Documents written in natural languages constitute a major part of the artifacts produced during the software engineering lifecycle. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. In this paper, we present a text mining system capable of populating a software ontology with information detected in documents. A particular novelty is the integration of results from automated source code analysis into an NLP pipeline, allowing to cross-link software artifacts represented in code and natural language on a semantic level.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

DiffLDA: Topic Evolution in Software Projects

Previous research has shown that topics can be automatically discovered in a software project’s source code. Topics are collections of words that co-occur frequently in a text collection and are discovered using topic models such as latent Dirichlet allocation (LDA). Tracking how topics evolve, i.e., grow and spread, over time is useful for supporting software maintenance, comprehension, and re...

متن کامل

Rework Effort Estimation of Self-admitted Technical Debt

Programmers sometimes leave incomplete, temporary workarounds and buggy codes that require rework. This phenomenon in software development is referred to as Selfadmitted Technical Debt (SATD). The challenge therefore is for software engineering researchers and practitioners to resolve the SATD problem to improve the software quality. We performed an exploratory study using a text mining approac...

متن کامل

Text Mining Through Semi Automatic Semantic Annotation

The Web is the greatest information source in human history. Unfortunately, mining knowledge out of this source is a laborious and errorprone task. Many researchers believe that a solution to the problem can be founded on semantic annotations that need to be inserted in web-based documents and guide information extraction and knowledge mining. In this paper, we further elaborate a tool-supporte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IET Software

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2008